-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
DOC: Improve documentation for DataFrame.__setitem__ and .loc assignment from Series #61804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@WillAyd Any thought on this resolution? |
Thanks @niruta25 for the PR
I see that "align" is found 16 times when searching "Intro to data structures" section of the docs. This chapter is only preceded by "10 minutes to pandas" so i'm not sure that the linked issue which states "The current documentation is incomplete and vague about how Series alignment works in assignments." is correct that this fundamental paradigm of pandas is not covered in the documentation. I'm not a member of the documentation team so others may be more positive to these changes, but if I was to review this PR, I would prefer to see more discussion on the issue itself before proceeding to the PR stage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK; from a quick glance I didn't see anything in the preceding text that was as clear
doc/source/user_guide/indexing.rst
Outdated
#If you want positional assignment instead of index alignment: | ||
# Convert Series to array/list for positional assignment | ||
|
||
df['positional'] = s1.values # or s1.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using .values
is typically discouraged, so I think should remove this
doc/source/user_guide/indexing.rst
Outdated
df['positional'] = s1.values # or s1.tolist() | ||
|
||
# Or reset the Series index to match DataFrame index | ||
df['reset_index'] = s1.reindex(df.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit but I think naming the column reset_index
here is distracting; maybe just call s1_values
?
by index labels, not by position. This means: | ||
* Values from the Series are matched to DataFrame rows by index label | ||
* If a Series index label doesn't exist in the DataFrame index, it's ignored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow the difference between this and the line directly following it with the distinction of ignored versus NaN - can you help me understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Series index labels NOT in DataFrame → IGNORED
df = pd.DataFrame({'A': [1, 2, 3]}, index=['x', 'y', 'z'])
s = pd.Series([10, 20, 30, 40, 50], index=['x', 'y', 'a', 'b', 'z'])
df['B'] = s
df
A B
x 1 10
y 2 20
z 3 50
# Values for 'a' and 'b' are completely ignored!
- DataFrame index labels NOT in Series → NaN
df = pd.DataFrame({'A': [1, 2, 3, 4]}, index=['x', 'y', 'z', 'w'])
s = pd.Series([100, 300], index=['x', 'z']) # Missing 'y' and 'w'
df['B'] = s
df
A B
x 1 100.0
y 2 NaN
z 3 300.0
w 4 NaN
# Rows 'y' and 'w' get NaN because they're missing from Series
Added the ignored example to documentation.
pandas/core/indexing.py
Outdated
**Assignment with Series** | ||
|
||
When assigning a Series to .loc[row_indexer, col_indexer], pandas aligns | ||
the Series by index labels, not by order or position. This is consistent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove the line This is consistent with pandas' general alignment behavior.
pandas/tests/frame/test_api.py
Outdated
@@ -378,3 +378,22 @@ def test_inspect_getmembers(self): | |||
# GH38740 | |||
df = DataFrame() | |||
inspect.getmembers(df) | |||
|
|||
def test_setitem_series_alignment_documentation(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely appreciate you adding tests, but since this is a documentation change you shouldn't need to add anything here. Feel free to remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok! Will generate another PR to add these tests in. Would be good to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#61822 for follow up for adding tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there an original discussion on tests being required? From a glance at them I would think our existing test base already covers those use cases, but maybe I am overlooking something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the issue does not mention anything about the missing test cases, but I haven;t seen the two cases I mentioned is covered. Hence the addition.
I addressed your comments, please let me know how's that look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd How do you find this changes post addressing your comments?
doc/source/user_guide/indexing.rst
Outdated
#If you want positional assignment instead of index alignment: | ||
# Convert Series to array/list for positional assignment | ||
|
||
df['positional'] = s1.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should generally encourage the tolist approach - that may have some rather big performance implications to convert each series element to a Python object.
I like what you are trying to do here, but I would just keep with the reindex approach in the second example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the late reply. Thanks for the ping
pandas/core/frame.py
Outdated
|
||
Parameters | ||
---------- | ||
key : str, list of str, or tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that the documented types are correct. Using pandas-stubs as a reference, it looks like this isn't typed at all, which probably means its more complex than documented here:
I'd just remove this - I think its better to be incomplete than incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick question! Since parameters is required for validating doctoring, what do you suggest should be mentioned for untyped key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm not sure what the best way to workaround that is, but I'd be OK if you just removed the Parameters section entirely. Can look further into it later if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or instead of specifying types you can say something very generic like:
key : The object(s) in the index which are to be assigned to
...
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could not drop Parameter section altogether cuz doctoring validation was failing. But the second option worked.
Addressed.
Let me know your thoughts now!
I do not have access to merge this PR. Can you please help. |
Thanks @niruta25 |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.The core problem is that when assigning a Series, pandas aligns on index and values in the Series that don't match an index label will result in NaN DOC: Improve documentation for DataFrame.setitem and .loc assignment from Series · Issue #61662 · pandas-dev/pandas, but this behavior is poorly documented.
My proposed solution addresses the issue comprehensively by:
The fix emphasizes that pandas performs index-based alignment rather than positional assignment, which is the source of confusion for many users. The documentation will now clearly explain that when you assign a Series to a DataFrame column, pandas matches values by index labels, not by position, and missing labels result in NaN values.
This solution follows pandas' documentation conventions and provides both reference documentation and practical examples that will help users understand and correctly use this important feature.